Colaboratory logo

What is Google Colaboratory?

Colaboratory, or "Colab" for short, allows you to write and execute Python in your browser, with

  • Zero configuration required
  • Free access to GPUs
  • Easy sharing

Whether you're a student, a data scientist or an AI researcher, Colab can make your work easier. Watch Introduction to Colab to learn more, or just get started below!

Introduction

We will create a movie recommendation system based on the MovieLens dataset available here. The data consists of movies ratings (on a scale of 1 to 5).

In [8]:
import nbconvert
In [ ]:
nbconvert
In [20]:
!jupyter nbconvert --to html HostedTemplateRecSysAug17.ipynb
[NbConvertApp] WARNING | pattern u'HostedTemplateRecSysAug17.ipynb' matched no files
This application is used to convert notebook files (*.ipynb) to various other
formats.

WARNING: THE COMMANDLINE INTERFACE MAY CHANGE IN FUTURE RELEASES.

Options
-------

Arguments that take values are actually convenience aliases to full
Configurables, whose aliases are listed on the help line. For more information
on full configurables, see '--help-all'.

--execute
    Execute the notebook prior to export.
--allow-errors
    Continue notebook execution even if one of the cells throws an error and include the error message in the cell output (the default behaviour is to abort conversion). This flag is only relevant if '--execute' was specified, too.
--no-input
    Exclude input cells and output prompts from converted document. 
    This mode is ideal for generating code-free reports.
--stdout
    Write notebook output to stdout instead of files.
--stdin
    read a single notebook file from stdin. Write the resulting notebook with default basename 'notebook.*'
--inplace
    Run nbconvert in place, overwriting the existing notebook (only 
    relevant when converting to notebook format)
-y
    Answer yes to any questions instead of prompting.
--clear-output
    Clear output of current file and save in place, 
    overwriting the existing notebook.
--debug
    set log level to logging.DEBUG (maximize logging output)
--no-prompt
    Exclude input and output prompts from converted document.
--generate-config
    generate default config file
--nbformat=<Enum> (NotebookExporter.nbformat_version)
    Default: 4
    Choices: [1, 2, 3, 4]
    The nbformat version to write. Use this to downgrade notebooks.
--output-dir=<Unicode> (FilesWriter.build_directory)
    Default: ''
    Directory to write output(s) to. Defaults to output to the directory of each
    notebook. To recover previous default behaviour (outputting to the current
    working directory) use . as the flag value.
--writer=<DottedObjectName> (NbConvertApp.writer_class)
    Default: 'FilesWriter'
    Writer class used to write the  results of the conversion
--log-level=<Enum> (Application.log_level)
    Default: 30
    Choices: (0, 10, 20, 30, 40, 50, 'DEBUG', 'INFO', 'WARN', 'ERROR', 'CRITICAL')
    Set the log level by value or name.
--reveal-prefix=<Unicode> (SlidesExporter.reveal_url_prefix)
    Default: u''
    The URL prefix for reveal.js (version 3.x). This defaults to the reveal CDN,
    but can be any url pointing to a copy  of reveal.js.
    For speaker notes to work, this must be a relative path to a local  copy of
    reveal.js: e.g., "reveal.js".
    If a relative path is given, it must be a subdirectory of the current
    directory (from which the server is run).
    See the usage documentation
    (https://nbconvert.readthedocs.io/en/latest/usage.html#reveal-js-html-
    slideshow) for more details.
--to=<Unicode> (NbConvertApp.export_format)
    Default: 'html'
    The export format to be used, either one of the built-in formats
    ['asciidoc', 'custom', 'html', 'latex', 'markdown', 'notebook', 'pdf',
    'python', 'rst', 'script', 'slides'] or a dotted object name that represents
    the import path for an `Exporter` class
--template=<Unicode> (TemplateExporter.template_file)
    Default: u''
    Name of the template file to use
--output=<Unicode> (NbConvertApp.output_base)
    Default: ''
    overwrite base name use for output files. can only be used when converting
    one notebook at a time.
--post=<DottedOrNone> (NbConvertApp.postprocessor_class)
    Default: u''
    PostProcessor class used to write the results of the conversion
--config=<Unicode> (JupyterApp.config_file)
    Default: u''
    Full path of a config file.

To see all available configurables, use `--help-all`

Examples
--------

    The simplest way to use nbconvert is
    
    > jupyter nbconvert mynotebook.ipynb
    
    which will convert mynotebook.ipynb to the default format (probably HTML).
    
    You can specify the export format with `--to`.
    Options include ['asciidoc', 'custom', 'html', 'latex', 'markdown', 'notebook', 'pdf', 'python', 'rst', 'script', 'slides'].
    
    > jupyter nbconvert --to latex mynotebook.ipynb
    
    Both HTML and LaTeX support multiple output templates. LaTeX includes
    'base', 'article' and 'report'.  HTML includes 'basic' and 'full'. You
    can specify the flavor of the format used.
    
    > jupyter nbconvert --to html --template basic mynotebook.ipynb
    
    You can also pipe the output to stdout, rather than a file
    
    > jupyter nbconvert mynotebook.ipynb --stdout
    
    PDF is generated via latex
    
    > jupyter nbconvert mynotebook.ipynb --to pdf
    
    You can get (and serve) a Reveal.js-powered slideshow
    
    > jupyter nbconvert myslides.ipynb --to slides --post serve
    
    Multiple notebooks can be given at the command line in a couple of 
    different ways:
    
    > jupyter nbconvert notebook*.ipynb
    > jupyter nbconvert notebook1.ipynb notebook2.ipynb
    
    or you can specify the notebooks list in a config file, containing::
    
        c.NbConvertApp.notebooks = ["my_notebook.ipynb"]
    
    > jupyter nbconvert --config mycfg.py

Hit the left-side â–º Play Button to run the code (Double-click this text to see all imported modules, and double-click again to hide)

In [2]:
# @title Imports (Double-click this text to see all imported modules, and double-click this text again again to hide)

from __future__ import print_function

print("loading modules and packages from web")
import numpy as np
import pandas as pd
import collections
from mpl_toolkits.mplot3d import Axes3D
from IPython import display
from matplotlib import pyplot as plt
import sklearn
import sklearn.manifold
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
tf.logging.set_verbosity(tf.logging.ERROR)

# Add some convenience functions to Pandas DataFrame.
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.3f}'.format
def mask(df, key, function):
  """Returns a filtered dataframe, by applying function to key"""
  return df[function(df[key])]

def flatten_cols(df):
  df.columns = [' '.join(col).strip() for col in df.columns.values]
  return df

pd.DataFrame.mask = mask
pd.DataFrame.flatten_cols = flatten_cols

# Install Altair and activate its colab renderer.
print("Installing Altair...")
!pip install git+git://github.com/altair-viz/altair.git
import altair as alt
alt.data_transformers.enable('default', max_rows=None)
alt.renderers.enable('colab')
print("Done installing Altair.")

# Install google spreadsheets and import authentication module.
USER_RATINGS = False
!pip install --upgrade -q gspread
from google.colab import auth
import gspread
from oauth2client.client import GoogleCredentials
loading modules and packages from web
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
Installing Altair...
Collecting git+git://github.com/altair-viz/altair.git
  Cloning git://github.com/altair-viz/altair.git to /tmp/pip-req-build-to9nkgjc
  Running command git clone -q git://github.com/altair-viz/altair.git /tmp/pip-req-build-to9nkgjc
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
    Preparing wheel metadata ... done
Requirement already satisfied: jsonschema in /usr/local/lib/python3.6/dist-packages (from altair==4.2.0.dev0) (2.6.0)
Requirement already satisfied: toolz in /usr/local/lib/python3.6/dist-packages (from altair==4.2.0.dev0) (0.10.0)
Requirement already satisfied: pandas>=0.18 in /usr/local/lib/python3.6/dist-packages (from altair==4.2.0.dev0) (1.0.5)
Requirement already satisfied: entrypoints in /usr/local/lib/python3.6/dist-packages (from altair==4.2.0.dev0) (0.3)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.6/dist-packages (from altair==4.2.0.dev0) (2.11.2)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from altair==4.2.0.dev0) (1.18.5)
Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.18->altair==4.2.0.dev0) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.18->altair==4.2.0.dev0) (2018.9)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.6/dist-packages (from jinja2->altair==4.2.0.dev0) (1.1.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.6/dist-packages (from python-dateutil>=2.6.1->pandas>=0.18->altair==4.2.0.dev0) (1.15.0)
Building wheels for collected packages: altair
  Building wheel for altair (PEP 517) ... done
  Created wheel for altair: filename=altair-4.2.0.dev0-cp36-none-any.whl size=729978 sha256=5fecf31d3dacd0ac1993d9155722ff2b7f911cfda9872f1295fcea51caae73ce
  Stored in directory: /tmp/pip-ephem-wheel-cache-rjezwd4i/wheels/01/fd/91/025b6149b3949af76e93b3b3ceca5bf12cbdebc98fa46f9ec6
Successfully built altair
Installing collected packages: altair
  Found existing installation: altair 4.1.0
    Uninstalling altair-4.1.0:
      Successfully uninstalled altair-4.1.0
Successfully installed altair-4.2.0.dev0
Done installing Altair.

We then download the MovieLens Data, and create DataFrames containing movies, users, and ratings.

In [3]:
# @title Load the MovieLens data (run this cell).

# Download MovieLens data.
print("Downloading movielens data...")
from urllib.request import urlretrieve
import zipfile

urlretrieve("http://files.grouplens.org/datasets/movielens/ml-100k.zip", "movielens.zip")
zip_ref = zipfile.ZipFile('movielens.zip', "r")
zip_ref.extractall()
print("Done. Dataset contains:")
print(zip_ref.read('ml-100k/u.info'))

# Load each data set (users, movies, and ratings).
users_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv(
    'ml-100k/u.user', sep='|', names=users_cols, encoding='latin-1')

ratings_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv(
    'ml-100k/u.data', sep='\t', names=ratings_cols, encoding='latin-1')

# The movies file contains a binary feature for each genre.
genre_cols = [
    "genre_unknown", "Action", "Adventure", "Animation", "Children", "Comedy",
    "Crime", "Documentary", "Drama", "Fantasy", "Film-Noir", "Horror",
    "Musical", "Mystery", "Romance", "Sci-Fi", "Thriller", "War", "Western"
]
movies_cols = [
    'movie_id', 'title', 'release_date', "video_release_date", "imdb_url"
] + genre_cols
movies = pd.read_csv(
    'ml-100k/u.item', sep='|', names=movies_cols, encoding='latin-1')

# Since the ids start at 1, we shift them to start at 0.
users["user_id"] = users["user_id"].apply(lambda x: str(x-1))
movies["movie_id"] = movies["movie_id"].apply(lambda x: str(x-1))
movies["year"] = movies['release_date'].apply(lambda x: str(x).split('-')[-1])
ratings["movie_id"] = ratings["movie_id"].apply(lambda x: str(x-1))
ratings["user_id"] = ratings["user_id"].apply(lambda x: str(x-1))
ratings["rating"] = ratings["rating"].apply(lambda x: float(x))

# Compute the number of movies to which a genre is assigned.
genre_occurences = movies[genre_cols].sum().to_dict()

# Since some movies can belong to more than one genre, we create different
# 'genre' columns as follows:
# - all_genres: all the active genres of the movie.
# - genre: randomly sampled from the active genres.
def mark_genres(movies, genres):
  def get_random_genre(gs):
    active = [genre for genre, g in zip(genres, gs) if g==1]
    if len(active) == 0:
      return 'Other'
    return np.random.choice(active)
  def get_all_genres(gs):
    active = [genre for genre, g in zip(genres, gs) if g==1]
    if len(active) == 0:
      return 'Other'
    return '-'.join(active)
  movies['genre'] = [
      get_random_genre(gs) for gs in zip(*[movies[genre] for genre in genres])]
  movies['all_genres'] = [
      get_all_genres(gs) for gs in zip(*[movies[genre] for genre in genres])]

mark_genres(movies, genre_cols)

# Create one merged DataFrame containing all the movielens data.
movielens = ratings.merge(movies, on='movie_id').merge(users, on='user_id')

# Utility to split the data into training and test sets.
def split_dataframe(df, holdout_fraction=0.1):
  """Splits a DataFrame into training and test sets.
  Args:
    df: a dataframe.
    holdout_fraction: fraction of dataframe rows to use in the test set.
  Returns:
    train: dataframe for training
    test: dataframe for testing
  """
  test = df.sample(frac=holdout_fraction, replace=False)
  train = df[~df.index.isin(test.index)]
  return train, test
Downloading movielens data...
Done. Dataset contains:
b'943 users\n1682 items\n100000 ratings\n'

I. Exploring the Movielens Data

Before we dive into model building, let's inspect our MovieLens dataset. It is usually helpful to understand the statistics of the dataset.

In [ ]:
movielens.head()

Users

We start by printing some basic statistics describing the numeric user features.

In [ ]:
users.head()
In [ ]:
users.describe()

We can also print some basic statistics describing the categorical user features

In [ ]:
users.describe(include=[np.object])

We can also create histograms to further understand the distribution of the users. We use Altair to create an interactive chart.

In [4]:
# @title Altair visualization code (run this cell)
# The following functions are used to generate interactive Altair charts.
# We will display histograms of the data, sliced by a given attribute.

# Create filters to be used to slice the data.
occupation_filter = alt.selection_multi(fields=["occupation"])
occupation_chart = alt.Chart().mark_bar().encode(
    x="count()",
    y=alt.Y("occupation:N"),
    color=alt.condition(
        occupation_filter,
        alt.Color("occupation:N", scale=alt.Scale(scheme='category20')),
        alt.value("lightgray")),
).properties(width=300, height=300, selection=occupation_filter)

# A function that generates a histogram of filtered data.
def filtered_hist(field, label, filter):
  """Creates a layered chart of histograms.
  The first layer (light gray) contains the histogram of the full data, and the
  second contains the histogram of the filtered data.
  Args:
    field: the field for which to generate the histogram.
    label: String label of the histogram.
    filter: an alt.Selection object to be used to filter the data.
  """
  base = alt.Chart().mark_bar().encode(
      x=alt.X(field, bin=alt.Bin(maxbins=10), title=label),
      y="count()",
  ).properties(
      width=300,
  )
  return alt.layer(
      base.transform_filter(filter),
      base.encode(color=alt.value('lightgray'), opacity=alt.value(.7)),
  ).resolve_scale(y='independent')

Next, we look at the distribution of ratings per user. Clicking on an occupation in the right chart will filter the data by that occupation. The corresponding histogram is shown in blue, and superimposed with the histogram for the whole data (in light gray). You can use SHIFT+click to select multiple subsets.

What do you observe, and how might this affect the recommendations?

In [6]:
#@title Distribution of User Ratings 
users_ratings = (
    ratings
    .groupby('user_id', as_index=False)
    .agg({'rating': ['count', 'mean']})
    .flatten_cols()
    .merge(users, on='user_id')
)

# Create a chart for the count, and one for the mean.
alt.hconcat(
    filtered_hist('rating count', '# ratings / user', occupation_filter),
    filtered_hist('rating mean', 'mean user rating', occupation_filter),
    occupation_chart,
    data=users_ratings)
Out[6]:

Movies

It is also useful to look at information about the movies and their ratings.

In [ ]:
#@title Movie filter and load
movies_ratings = movies.merge(
    ratings
    .groupby('movie_id', as_index=False)
    .agg({'rating': ['count', 'mean']})
    .flatten_cols(),
    on='movie_id')

genre_filter = alt.selection_multi(fields=['genre'])
genre_chart = alt.Chart().mark_bar().encode(
    x="count()",
    y=alt.Y('genre'),
    color=alt.condition(
        genre_filter,
        alt.Color("genre:N"),
        alt.value('lightgray'))
).properties(height=300, selection=genre_filter)
In [ ]:
(movies_ratings[['title', 'rating count', 'rating mean']]
 .sort_values('rating count', ascending=False)
 .head(10))
In [ ]:
(movies_ratings[['title', 'rating count', 'rating mean']]
 .mask('rating count', lambda x: x > 20)
 .sort_values('rating mean', ascending=False)
 .head(10))
In [ ]:
# @title Distribution of movie ratings and average rating per movie.
alt.hconcat(
    filtered_hist('rating count', '# ratings / movie', genre_filter),
    filtered_hist('rating mean', 'mean movie rating', genre_filter),
    genre_chart,
    data=movies_ratings)
Out[ ]:

Exercise 1: Build a tf.SparseTensor representation of the Rating Matrix.

In this exercise, we'll write a function that maps from our ratings DataFrame to a tf.SparseTensor.

Hint: you can select the values of a given column of a Dataframe df using df['column_name'].values.

In [ ]:
#@title Build a tf.SparseTensor representation of the Rating Matrix
def build_rating_sparse_tensor(ratings_df):
  """
  Args:
    ratings_df: a pd.DataFrame with `user_id`, `movie_id` and `rating` columns.
  Returns:
    a tf.SparseTensor representing the ratings matrix.
  """
  indices = ratings_df[['user_id', 'movie_id']].values
  values = ratings_df['rating'].values
  return tf.SparseTensor(
      indices=indices,
      values=values,
      dense_shape=[users.shape[0], movies.shape[0]])
  
def sparse_mean_square_error(sparse_ratings, user_embeddings, movie_embeddings):
  """
  Args:
    sparse_ratings: A SparseTensor rating matrix, of dense_shape [N, M]
    user_embeddings: A dense Tensor U of shape [N, k] where k is the embedding
      dimension, such that U_i is the embedding of user i.
    movie_embeddings: A dense Tensor V of shape [M, k] where k is the embedding
      dimension, such that V_j is the embedding of movie j.
  Returns:
    A scalar Tensor representing the MSE between the true ratings and the
      model's predictions.
  """
  predictions = tf.gather_nd(
      tf.matmul(user_embeddings, movie_embeddings, transpose_b=True),
      sparse_ratings.indices)
  loss = tf.losses.mean_squared_error(sparse_ratings.values, predictions)
  return loss

def sparse_mean_square_error(sparse_ratings, user_embeddings, movie_embeddings):
  """
  Args:
    sparse_ratings: A SparseTensor rating matrix, of dense_shape [N, M]
    user_embeddings: A dense Tensor U of shape [N, k] where k is the embedding
      dimension, such that U_i is the embedding of user i.
    movie_embeddings: A dense Tensor V of shape [M, k] where k is the embedding
      dimension, such that V_j is the embedding of movie j.
  Returns:
    A scalar Tensor representing the MSE between the true ratings and the
      model's predictions.
  """
  predictions = tf.reduce_sum(
      tf.gather(user_embeddings, sparse_ratings.indices[:, 0]) *
      tf.gather(movie_embeddings, sparse_ratings.indices[:, 1]),
      axis=1)
  loss = tf.losses.mean_squared_error(sparse_ratings.values, predictions)
  return loss
In [ ]:
#@title Solution # mark for deletion
def sparse_mean_square_error(sparse_ratings, user_embeddings, movie_embeddings):
  """
  Args:
    sparse_ratings: A SparseTensor rating matrix, of dense_shape [N, M]
    user_embeddings: A dense Tensor U of shape [N, k] where k is the embedding
      dimension, such that U_i is the embedding of user i.
    movie_embeddings: A dense Tensor V of shape [M, k] where k is the embedding
      dimension, such that V_j is the embedding of movie j.
  Returns:
    A scalar Tensor representing the MSE between the true ratings and the
      model's predictions.
  """
  predictions = tf.gather_nd(
      tf.matmul(user_embeddings, movie_embeddings, transpose_b=True),
      sparse_ratings.indices)
  loss = tf.losses.mean_squared_error(sparse_ratings.values, predictions)
  return loss
In [ ]:
#@title Alternate Solution # mark for deletion
def sparse_mean_square_error(sparse_ratings, user_embeddings, movie_embeddings):
  """
  Args:
    sparse_ratings: A SparseTensor rating matrix, of dense_shape [N, M]
    user_embeddings: A dense Tensor U of shape [N, k] where k is the embedding
      dimension, such that U_i is the embedding of user i.
    movie_embeddings: A dense Tensor V of shape [M, k] where k is the embedding
      dimension, such that V_j is the embedding of movie j.
  Returns:
    A scalar Tensor representing the MSE between the true ratings and the
      model's predictions.
  """
  predictions = tf.reduce_sum(
      tf.gather(user_embeddings, sparse_ratings.indices[:, 0]) *
      tf.gather(movie_embeddings, sparse_ratings.indices[:, 1]),
      axis=1)
  loss = tf.losses.mean_squared_error(sparse_ratings.values, predictions)
  return loss

Exercise 3 (Optional): adding your own ratings to the data set

You have the option to add your own ratings to the data set. If you choose to do so, you will be able to see recommendations for yourself.

Start by checking the box below. Running the next cell will authenticate you to your google Drive account, and create a spreadsheet, that contains all movie titles in column 'A'. Follow the link to the spreadsheet and take 3 minutes to rate some of the movies. Your ratings should be entered in column 'B'.

In [ ]:
USER_RATINGS = True #@param {type:"boolean"}
In [ ]:
# @title Run to create a spreadsheet, then use it to enter your ratings.
# Authenticate user.
if USER_RATINGS:
  auth.authenticate_user()
  gc = gspread.authorize(GoogleCredentials.get_application_default())
  # Create the spreadsheet and print a link to it.
  try:
    sh = gc.open('MovieLens-test')
  except(gspread.SpreadsheetNotFound):
    sh = gc.create('MovieLens-test')

  worksheet = sh.sheet1
  titles = movies['title'].values
  cell_list = worksheet.range(1, 1, len(titles), 1)
  for cell, title in zip(cell_list, titles):
    cell.value = title
  worksheet.update_cells(cell_list)
  print("Link to the spreadsheet: "
        "https://docs.google.com/spreadsheets/d/{}/edit".format(sh.id))
Link to the spreadsheet: https://docs.google.com/spreadsheets/d/1ks87nuM_arJuhbT_IVJFiJzvUIk9hGFhlawxVPjT3hc/edit

Run the next cell to load your ratings and add them to the main ratings DataFrame.

In [ ]:
# @title Run to load your ratings.
# Load the ratings from the spreadsheet and create a DataFrame.
if USER_RATINGS:
  my_ratings = pd.DataFrame.from_records(worksheet.get_all_values()).reset_index()
  my_ratings = my_ratings[my_ratings[1] != '']
  my_ratings = pd.DataFrame({
      'user_id': "20",
      'movie_id': list(map(str, my_ratings['index'])),
      'rating': list(map(float, my_ratings[1])),
  })
  # Remove previous ratings.
  ratings = ratings[ratings.user_id != "20"]
  # Add new ratings.
  ratings = ratings.append(my_ratings, ignore_index=True)
  # Add new user to the users DataFrame.
  if users.shape[0] == 20:
    users = users.append(users.iloc[942], ignore_index=True)
    users["user_id"][20] = "20"
  print("Added your %d ratings; you have great taste!" % len(my_ratings))
  ratings[ratings.user_id=="20"].merge(movies[['movie_id', 'title']])
Added your 84 ratings; you have great taste!
In [ ]:
# @title CFModel helper class (run this cell)
class CFModel(object):
  """Simple class that represents a collaborative filtering model"""
  def __init__(self, embedding_vars, loss, metrics=None):
    """Initializes a CFModel.
    Args:
      embedding_vars: A dictionary of tf.Variables.
      loss: A float Tensor. The loss to optimize.
      metrics: optional list of dictionaries of Tensors. The metrics in each
        dictionary will be plotted in a separate figure during training.
    """
    self._embedding_vars = embedding_vars
    self._loss = loss
    self._metrics = metrics
    self._embeddings = {k: None for k in embedding_vars}
    self._session = None

  @property
  def embeddings(self):
    """The embeddings dictionary."""
    return self._embeddings

  def train(self, num_iterations=100, learning_rate=1.0, plot_results=True,
            optimizer=tf.train.GradientDescentOptimizer):
    """Trains the model.
    Args:
      iterations: number of iterations to run.
      learning_rate: optimizer learning rate.
      plot_results: whether to plot the results at the end of training.
      optimizer: the optimizer to use. Default to GradientDescentOptimizer.
    Returns:
      The metrics dictionary evaluated at the last iteration.
    """
    with self._loss.graph.as_default():
      opt = optimizer(learning_rate)
      train_op = opt.minimize(self._loss)
      local_init_op = tf.group(
          tf.variables_initializer(opt.variables()),
          tf.local_variables_initializer())
      if self._session is None:
        self._session = tf.Session()
        with self._session.as_default():
          self._session.run(tf.global_variables_initializer())
          self._session.run(tf.tables_initializer())
          tf.train.start_queue_runners()

    with self._session.as_default():
      local_init_op.run()
      iterations = []
      metrics = self._metrics or ({},)
      metrics_vals = [collections.defaultdict(list) for _ in self._metrics]

      # Train and append results.
      for i in range(num_iterations + 1):
        _, results = self._session.run((train_op, metrics))
        if (i % 10 == 0) or i == num_iterations:
          print("\r iteration %d: " % i + ", ".join(
                ["%s=%f" % (k, v) for r in results for k, v in r.items()]),
                end='')
          iterations.append(i)
          for metric_val, result in zip(metrics_vals, results):
            for k, v in result.items():
              metric_val[k].append(v)

      for k, v in self._embedding_vars.items():
        self._embeddings[k] = v.eval()

      if plot_results:
        # Plot the metrics.
        num_subplots = len(metrics)+1
        fig = plt.figure()
        fig.set_size_inches(num_subplots*10, 8)
        for i, metric_vals in enumerate(metrics_vals):
          ax = fig.add_subplot(1, num_subplots, i+1)
          for k, v in metric_vals.items():
            ax.plot(iterations, v, label=k)
          ax.set_xlim([1, num_iterations])
          ax.legend()
      return results

Exercise 4: Build a Matrix Factorization model and train it

Using your sparse_mean_square_error function, write a function that builds a CFModel by creating the embedding variables and the train and test losses.

In [ ]:
#@title Solution
def build_model(ratings, embedding_dim=3, init_stddev=1.):
  """
  Args:
    ratings: a DataFrame of the ratings
    embedding_dim: the dimension of the embedding vectors.
    init_stddev: float, the standard deviation of the random initial embeddings.
  Returns:
    model: a CFModel.
  """
  # Split the ratings DataFrame into train and test.
  train_ratings, test_ratings = split_dataframe(ratings)
  # SparseTensor representation of the train and test datasets.
  A_train = build_rating_sparse_tensor(train_ratings)
  A_test = build_rating_sparse_tensor(test_ratings)
  # Initialize the embeddings using a normal distribution.
  U = tf.Variable(tf.random_normal(
      [A_train.dense_shape[0], embedding_dim], stddev=init_stddev))
  V = tf.Variable(tf.random_normal(
      [A_train.dense_shape[1], embedding_dim], stddev=init_stddev))
  train_loss = sparse_mean_square_error(A_train, U, V)
  test_loss = sparse_mean_square_error(A_test, U, V)
  metrics = {
      'train_error': train_loss,
      'test_error': test_loss
  }
  embeddings = {
      "user_id": U,
      "movie_id": V
  }
  return CFModel(embeddings, train_loss, [metrics])

Great, now it's time to train the model!

Go ahead and run the next cell, trying different parameters (embedding dimension, learning rate, iterations). The training and test errors are plotted at the end of training. You can inspect these values to validate the hyper-parameters.

Note: by calling model.train again, the model will continue training starting from the current values of the embeddings.

In [ ]:
# Build the CF model and train it.
model = build_model(ratings, embedding_dim=30, init_stddev=0.5)
model.train(num_iterations=1000, learning_rate=10.)
 iteration 1000: train_error=0.376171, test_error=1.385726
Out[ ]:
[{'test_error': 1.3857261, 'train_error': 0.37617064}]

The movie and user embeddings are also displayed in the right figure. When the embedding dimension is greater than 3, the embeddings are projected on the first 3 dimensions. The next section will have a more detailed look at the embeddings.

IV. Inspecting the Embeddings

In this section, we take a closer look at the learned embeddings, by

  • computing your recommendations
  • looking at the nearest neighbors of some movies,
  • looking at the norms of the movie embeddings,
  • visualizing the embedding in a projected embedding space.
In [ ]:
#@title Solution
DOT = 'dot'
COSINE = 'cosine'
def compute_scores(query_embedding, item_embeddings, measure=DOT):
  """Computes the scores of the candidates given a query.
  Args:
    query_embedding: a vector of shape [k], representing the query embedding.
    item_embeddings: a matrix of shape [N, k], such that row i is the embedding
      of item i.
    measure: a string specifying the similarity measure to be used. Can be
      either DOT or COSINE.
  Returns:
    scores: a vector of shape [N], such that scores[i] is the score of item i.
  """
  u = query_embedding
  V = item_embeddings
  if measure == COSINE:
    V = V / np.linalg.norm(V, axis=1, keepdims=True)
    u = u / np.linalg.norm(u)
  scores = u.dot(V.T)
  return scores

# @title User recommendations and nearest neighbors (run this cell)
def user_recommendations(model, measure=DOT, exclude_rated=False, k=6):
  if USER_RATINGS:
    scores = compute_scores(
        model.embeddings["user_id"][20], model.embeddings["movie_id"], measure)
    score_key = measure + ' score'
    df = pd.DataFrame({
        score_key: list(scores),
        'movie_id': movies['movie_id'],
        'titles': movies['title'],
        'genres': movies['all_genres'],
    })
    if exclude_rated:
      # remove movies that are already rated
      rated_movies = ratings[ratings.user_id == "20"]["movie_id"].values
      df = df[df.movie_id.apply(lambda movie_id: movie_id not in rated_movies)]
    display.display(df.sort_values([score_key], ascending=False).head(k))  

def movie_neighbors(model, title_substring, measure=DOT, k=10):
  # Search for movie ids that match the given substring.
  ids =  movies[movies['title'].str.contains(title_substring)].index.values
  titles = movies.iloc[ids]['title'].values
  if len(titles) == 0:
    raise ValueError("Found no movies with title %s" % title_substring)
  print("Nearest neighbors of : %s." % titles[0])
  if len(titles) > 1:
    print("[Found more than one matching movie. Other candidates: {}]".format(
        ", ".join(titles[1:])))
  movie_id = ids[0]
  scores = compute_scores(
      model.embeddings["movie_id"][movie_id], model.embeddings["movie_id"],
      measure)
  score_key = measure + ' score'
  df = pd.DataFrame({
      score_key: list(scores),
      'titles': movies['title'],
      'genres': movies['all_genres']
  })
  display.display(df.sort_values([score_key], ascending=False).head(k))
In [ ]:
# @title User recommendations and nearest neighbors (run this cell)# mark for deletion
def user_recommendations(model, measure=DOT, exclude_rated=False, k=6):
  if USER_RATINGS:
    scores = compute_scores(
        model.embeddings["user_id"][20], model.embeddings["movie_id"], measure)
    score_key = measure + ' score'
    df = pd.DataFrame({
        score_key: list(scores),
        'movie_id': movies['movie_id'],
        'titles': movies['title'],
        'genres': movies['all_genres'],
    })
    if exclude_rated:
      # remove movies that are already rated
      rated_movies = ratings[ratings.user_id == "20"]["movie_id"].values
      df = df[df.movie_id.apply(lambda movie_id: movie_id not in rated_movies)]
    display.display(df.sort_values([score_key], ascending=False).head(k))  

def movie_neighbors(model, title_substring, measure=DOT, k=12):
  # Search for movie ids that match the given substring.
  ids =  movies[movies['title'].str.contains(title_substring)].index.values
  titles = movies.iloc[ids]['title'].values
  if len(titles) == 0:
    raise ValueError("Found no movies with title %s" % title_substring)
  print("Nearest neighbors of : %s." % titles[0])
  if len(titles) > 1:
    print("[Found more than one matching movie. Other candidates: {}]".format(
        ", ".join(titles[1:])))
  movie_id = ids[0]
  scores = compute_scores(
      model.embeddings["movie_id"][movie_id], model.embeddings["movie_id"],
      measure)
  score_key = measure + ' score'
  df = pd.DataFrame({
      score_key: list(scores),
      'titles': movies['title'],
      'genres': movies['all_genres']
  })
  display.display(df.sort_values([score_key], ascending=False).head(k))

Equipped with this function, we can compute recommendations, where the query embedding can be either a user embedding or a movie embedding.

Your recommendations

If you chose to input your recommendations, you can run the next cell to generate recommendations for you.

In [ ]:
user_recommendations(model, measure=COSINE, k=10, exclude_rated=False)
cosine score movie_id titles genres
221 0.701 221 Star Trek: First Contact (1996) Action-Adventure-Sci-Fi
379 0.692 379 Star Trek: Generations (1994) Action-Adventure-Sci-Fi
322 0.687 322 Dante's Peak (1997) Action-Thriller
635 0.683 635 Escape from New York (1981) Action-Adventure-Sci-Fi-Thriller
229 0.678 229 Star Trek IV: The Voyage Home (1986) Action-Adventure-Sci-Fi
684 0.675 684 Executive Decision (1996) Action-Thriller
209 0.674 209 Indiana Jones and the Last Crusade (1989) Action-Adventure
747 0.660 747 Saint, The (1997) Action-Romance-Thriller
844 0.657 844 That Thing You Do! (1996) Comedy
404 0.649 404 Mission: Impossible (1996) Action-Adventure-Mystery

How do the recommendations look?

Movie Nearest neighbors

Let's look at the neareast neighbors for some of the movies.

In [ ]:
movie_neighbors(model, "Aladdin", DOT)
movie_neighbors(model, "Aladdin", COSINE)
Nearest neighbors of : Aladdin (1992).
[Found more than one matching movie. Other candidates: Aladdin and the King of Thieves (1996)]
dot score titles genres
94 6.561 Aladdin (1992) Animation-Children-Comedy-Musical
1210 5.633 Blue Sky (1994) Drama-Romance
1011 5.385 Private Parts (1997) Comedy-Drama
1141 5.375 When We Were Kings (1996) Documentary
271 5.357 Good Will Hunting (1997) Drama
... ... ... ...
753 5.198 Red Corner (1997) Crime-Thriller
941 5.174 What's Love Got to Do with It (1993) Drama
193 5.173 Sting, The (1973) Comedy-Crime
173 5.170 Raiders of the Lost Ark (1981) Action-Adventure
1143 5.147 Quiet Room, The (1996) Drama

12 rows × 3 columns

Nearest neighbors of : Aladdin (1992).
[Found more than one matching movie. Other candidates: Aladdin and the King of Thieves (1996)]
cosine score titles genres
94 1.000 Aladdin (1992) Animation-Children-Comedy-Musical
753 0.816 Red Corner (1997) Crime-Thriller
193 0.795 Sting, The (1973) Comedy-Crime
281 0.794 Time to Kill, A (1996) Drama
470 0.788 Courage Under Fire (1996) Drama-War
... ... ... ...
70 0.773 Lion King, The (1994) Animation-Children-Musical
124 0.760 Phenomenon (1996) Drama-Romance
392 0.753 Mrs. Doubtfire (1993) Comedy
110 0.752 Truth About Cats & Dogs, The (1996) Comedy-Romance
434 0.748 Butch Cassidy and the Sundance Kid (1969) Action-Comedy-Western

12 rows × 3 columns

It seems that the quality of learned embeddings may not be very good. This will be addressed in Section V by adding several regularization techniques. First, we will further inspect the embeddings.

Movie Embedding Norm

We can also observe that the recommendations with dot-product and cosine are different: with dot-product, the model tends to recommend popular movies. This can be explained by the fact that in matrix factorization models, the norm of the embedding is often correlated with popularity (popular movies have a larger norm), which makes it more likely to recommend more popular items. We can confirm this hypothesis by sorting the movies by their embedding norm, as done in the next cell.

In [ ]:
# @title Embedding Visualization code (run this cell)

def movie_embedding_norm(models):
  """Visualizes the norm and number of ratings of the movie embeddings.
  Args:
    model: A MFModel object.
  """
  if not isinstance(models, list):
    models = [models]
  df = pd.DataFrame({
      'title': movies['title'],
      'genre': movies['genre'],
      'num_ratings': movies_ratings['rating count'],
  })
  charts = []
  brush = alt.selection_interval()
  for i, model in enumerate(models):
    norm_key = 'norm'+str(i)
    df[norm_key] = np.linalg.norm(model.embeddings["movie_id"], axis=1)
    nearest = alt.selection(
        type='single', encodings=['x', 'y'], on='mouseover', nearest=True,
        empty='none')
    base = alt.Chart().mark_circle().encode(
        x='num_ratings',
        y=norm_key,
        color=alt.condition(brush, alt.value('#4c78a8'), alt.value('lightgray'))
    ).properties(
        selection=nearest).add_selection(brush)
    text = alt.Chart().mark_text(align='center', dx=5, dy=-5).encode(
        x='num_ratings', y=norm_key,
        text=alt.condition(nearest, 'title', alt.value('')))
    charts.append(alt.layer(base, text))
  return alt.hconcat(*charts, data=df)

def visualize_movie_embeddings(data, x, y):
  nearest = alt.selection(
      type='single', encodings=['x', 'y'], on='mouseover', nearest=True,
      empty='none')
  base = alt.Chart().mark_circle().encode(
      x=x,
      y=y,
      color=alt.condition(genre_filter, "genre", alt.value("whitesmoke")),
  ).properties(
      width=600,
      height=600,
      selection=nearest)
  text = alt.Chart().mark_text(align='left', dx=5, dy=-5).encode(
      x=x,
      y=y,
      text=alt.condition(nearest, 'title', alt.value('')))
  return alt.hconcat(alt.layer(base, text), genre_chart, data=data)

def tsne_movie_embeddings(model):
  """Visualizes the movie embeddings, projected using t-SNE with Cosine measure.
  Args:
    model: A MFModel object.
  """
  tsne = sklearn.manifold.TSNE(
      n_components=2, perplexity=40, metric='cosine', early_exaggeration=10.0,
      init='pca', verbose=True, n_iter=400)

  print('Running t-SNE...')
  V_proj = tsne.fit_transform(model.embeddings["movie_id"])
  movies.loc[:,'x'] = V_proj[:, 0]
  movies.loc[:,'y'] = V_proj[:, 1]
  return visualize_movie_embeddings(movies, 'x', 'y')
In [ ]:
tsne_movie_embeddings(model)
Running t-SNE...
[t-SNE] Computing 121 nearest neighbors...
[t-SNE] Indexed 1682 samples in 0.000s...
[t-SNE] Computed neighbors for 1682 samples in 0.096s...
[t-SNE] Computed conditional probabilities for sample 1000 / 1682
[t-SNE] Computed conditional probabilities for sample 1682 / 1682
[t-SNE] Mean sigma: 0.204633
[t-SNE] KL divergence after 250 iterations with early exaggeration: 57.427864
[t-SNE] KL divergence after 400 iterations: 2.704980
Out[ ]:
In [ ]:
movie_embedding_norm(model)
Out[ ]:

Note: Depending on how the model is initialized, you may observe that some niche movies (ones with few ratings) have a high norm, leading to spurious recommendations. This can happen if the embedding of that movie happens to be initialized with a high norm. Then, because the movie has few ratings, it is infrequently updated, and can keep its high norm. This will be alleviated by using regularization.

Try changing the value of the hyper-parameter init_stddev. One quantity that can be helpful is that the expected norm of a $d$-dimensional vector with entries $\sim \mathcal N(0, \sigma^2)$ is approximatley $\sigma \sqrt d$.

How does this affect the embedding norm distribution, and the ranking of the top-norm movies?

In [ ]:
#@title Solution
model_lowinit = build_model(ratings, embedding_dim=30, init_stddev=0.05)
model_lowinit.train(num_iterations=1000, learning_rate=10.)
movie_neighbors(model_lowinit, "Aladdin", DOT)
movie_neighbors(model_lowinit, "Aladdin", COSINE)
movie_embedding_norm([model, model_lowinit])
 iteration 1000: train_error=0.352223, test_error=0.968633Nearest neighbors of : Aladdin (1992).
[Found more than one matching movie. Other candidates: Aladdin and the King of Thieves (1996)]
dot score titles genres
94 5.833 Aladdin (1992) Animation-Children-Comedy-Musical
0 5.428 Toy Story (1995) Animation-Children-Comedy
704 4.942 Singin' in the Rain (1952) Musical-Romance
256 4.889 Men in Black (1997) Action-Adventure-Comedy-Sci-Fi
587 4.884 Beauty and the Beast (1991) Animation-Children-Musical
49 4.684 Star Wars (1977) Action-Adventure-Romance-Sci-Fi-War
171 4.674 Empire Strikes Back, The (1980) Action-Adventure-Drama-Romance-Sci-Fi-War
227 4.664 Star Trek: The Wrath of Khan (1982) Action-Adventure-Sci-Fi
70 4.653 Lion King, The (1994) Animation-Children-Musical
236 4.641 Jerry Maguire (1996) Drama-Romance
Nearest neighbors of : Aladdin (1992).
[Found more than one matching movie. Other candidates: Aladdin and the King of Thieves (1996)]
cosine score titles genres
94 1.000 Aladdin (1992) Animation-Children-Comedy-Musical
70 0.818 Lion King, The (1994) Animation-Children-Musical
704 0.812 Singin' in the Rain (1952) Musical-Romance
1226 0.811 Awfully Big Adventure, An (1995) Drama
587 0.806 Beauty and the Beast (1991) Animation-Children-Musical
469 0.805 Tombstone (1993) Western
1189 0.802 That Old Feeling (1997) Comedy-Romance
419 0.802 Alice in Wonderland (1951) Animation-Children-Musical
1414 0.801 Next Karate Kid, The (1994) Action-Children
730 0.800 Corrina, Corrina (1994) Comedy-Drama-Romance
Out[ ]:

Embedding visualization

Since it is hard to visualize embeddings in a higher-dimensional space (when the embedding dimension $k > 3$), one approach is to project the embeddings to a lower dimensional space. T-SNE (T-distributed Stochastic Neighbor Embedding) is an algorithm that projects the embeddings while attempting to preserve their pariwise distances. It can be useful for visualization, but one should use it with care. For more information on using t-SNE, see How to Use t-SNE Effectively.

In [ ]:
tsne_movie_embeddings(model_lowinit)
Running t-SNE...
[t-SNE] Computing 121 nearest neighbors...
[t-SNE] Indexed 1682 samples in 0.000s...
[t-SNE] Computed neighbors for 1682 samples in 0.093s...
[t-SNE] Computed conditional probabilities for sample 1000 / 1682
[t-SNE] Computed conditional probabilities for sample 1682 / 1682
[t-SNE] Mean sigma: 0.117549
[t-SNE] KL divergence after 250 iterations with early exaggeration: 58.121090
[t-SNE] KL divergence after 400 iterations: 2.173100
Out[ ]:

You can highlight the embeddings of a given genre by clicking on the genres panel (SHIFT+click to select multiple genres).

We can observe that the embeddings do not seem to have any notable structure, and the embeddings of a given genre are located all over the embedding space. This confirms the poor quality of the learned embeddings. One of the main reasons, which we will address in the next section, is that we only trained the model on observed pairs, and without regularization.

V. Regularization In Matrix Factorization

In [ ]:
# @title Solution
def gravity(U, V):
  """Creates a gravity loss given two embedding matrices."""
  return 1. / (U.shape[0].value*V.shape[0].value) * tf.reduce_sum(
      tf.matmul(U, U, transpose_a=True) * tf.matmul(V, V, transpose_a=True))

def build_regularized_model(
    ratings, embedding_dim=3, regularization_coeff=.1, gravity_coeff=1.,
    init_stddev=0.1):
  """
  Args:
    ratings: the DataFrame of movie ratings.
    embedding_dim: The dimension of the embedding space.
    regularization_coeff: The regularization coefficient lambda.
    gravity_coeff: The gravity regularization coefficient lambda_g.
  Returns:
    A CFModel object that uses a regularized loss.
  """
  # Split the ratings DataFrame into train and test.
  train_ratings, test_ratings = split_dataframe(ratings)
  # SparseTensor representation of the train and test datasets.
  A_train = build_rating_sparse_tensor(train_ratings)
  A_test = build_rating_sparse_tensor(test_ratings)
  U = tf.Variable(tf.random_normal(
      [A_train.dense_shape[0], embedding_dim], stddev=init_stddev))
  V = tf.Variable(tf.random_normal(
      [A_train.dense_shape[1], embedding_dim], stddev=init_stddev))

  error_train = sparse_mean_square_error(A_train, U, V)
  error_test = sparse_mean_square_error(A_test, U, V)
  gravity_loss = gravity_coeff * gravity(U, V)
  regularization_loss = regularization_coeff * (
      tf.reduce_sum(U*U)/U.shape[0].value + tf.reduce_sum(V*V)/V.shape[0].value)
  total_loss = error_train + regularization_loss + gravity_loss
  losses = {
      'train_error_observed': error_train,
      'test_error_observed': error_test,
  }
  loss_components = {
      'observed_loss': error_train,
      'regularization_loss': regularization_loss,
      'gravity_loss': gravity_loss,
      
  }
  embeddings = {"user_id": U, "movie_id": V}

  return CFModel(embeddings, total_loss, [losses, loss_components])

It is now time to train the regularized model! You can try different values of the regularization coefficients, and different embedding dimensions.

In [ ]:
reg_model = build_regularized_model(
    ratings, regularization_coeff=0.1, gravity_coeff=1.0, embedding_dim=35,
    init_stddev=.05)
reg_model.train(num_iterations=2000, learning_rate=20.)
 iteration 2000: train_error_observed=1.001164, test_error_observed=2.461867, observed_loss=1.001164, regularization_loss=0.852175, gravity_loss=1.316355
Out[ ]:
[{'test_error_observed': 2.4618673, 'train_error_observed': 1.0011644},
 {'gravity_loss': 1.3163552,
  'observed_loss': 1.0011644,
  'regularization_loss': 0.85217494}]

Observe that adding the regularization terms results in a higher MSE, both on the training and test set. However, as we will see, the quality of the recommendations improves. This highlights a tension between fitting the observed data and minimizing the regularization terms. Fitting the observed data often emphasizes learning high similarity (between items with many interactions), but a good embedding representation also requires learning low similarity (between items with few or no interactions).

Inspect the results

Let's see if the results with regularization look better.

In [ ]:
user_recommendations(reg_model, DOT, exclude_rated=True, k=10)
dot score movie_id titles genres
171 4.614 171 Empire Strikes Back, The (1980) Action-Adventure-Drama-Romance-Sci-Fi-War
384 4.259 384 True Lies (1994) Action-Adventure-Comedy-Romance
173 4.176 173 Raiders of the Lost Ark (1981) Action-Adventure
407 3.765 407 Close Shave, A (1995) Animation-Comedy-Thriller
209 3.743 209 Indiana Jones and the Last Crusade (1989) Action-Adventure
187 3.618 187 Full Metal Jacket (1987) Action-Drama-War
402 3.611 402 Batman (1989) Action-Adventure-Crime-Drama
78 3.532 78 Fugitive, The (1993) Action-Thriller
113 3.516 113 Wallace & Gromit: The Best of Aardman Animatio... Animation
312 3.506 312 Titanic (1997) Action-Drama-Romance
In [ ]:
movie_neighbors(reg_model, "Aladdin", DOT)
movie_neighbors(reg_model, "Aladdin", COSINE)
Nearest neighbors of : Aladdin (1992).
[Found more than one matching movie. Other candidates: Aladdin and the King of Thieves (1996)]
dot score titles genres
94 9.617 Aladdin (1992) Animation-Children-Comedy-Musical
70 8.132 Lion King, The (1994) Animation-Children-Musical
171 7.919 Empire Strikes Back, The (1980) Action-Adventure-Drama-Romance-Sci-Fi-War
587 7.840 Beauty and the Beast (1991) Animation-Children-Musical
317 7.706 Schindler's List (1993) Drama-War
173 7.664 Raiders of the Lost Ark (1981) Action-Adventure
81 7.437 Jurassic Park (1993) Action-Adventure-Sci-Fi
0 7.343 Toy Story (1995) Animation-Children-Comedy
21 7.321 Braveheart (1995) Action-Drama-War
7 7.153 Babe (1995) Children-Comedy-Drama
Nearest neighbors of : Aladdin (1992).
[Found more than one matching movie. Other candidates: Aladdin and the King of Thieves (1996)]
cosine score titles genres
94 1.000 Aladdin (1992) Animation-Children-Comedy-Musical
70 0.877 Lion King, The (1994) Animation-Children-Musical
417 0.821 Cinderella (1950) Animation-Children-Musical
587 0.821 Beauty and the Beast (1991) Animation-Children-Musical
81 0.795 Jurassic Park (1993) Action-Adventure-Sci-Fi
7 0.736 Babe (1995) Children-Comedy-Drama
418 0.734 Mary Poppins (1964) Children-Comedy-Musical
98 0.713 Snow White and the Seven Dwarfs (1937) Animation-Children-Musical
227 0.700 Star Trek: The Wrath of Khan (1982) Action-Adventure-Sci-Fi
27 0.697 Apollo 13 (1995) Action-Drama-Thriller

Here we compare the embedding norms for model and reg_model. Selecting a subset of the embeddings will highlight them on both charts simultaneously.

In [ ]:
movie_embedding_norm([model, model_lowinit, reg_model])
Out[ ]:
In [ ]:
# Visualize the embeddings
tsne_movie_embeddings(reg_model)
Running t-SNE...
[t-SNE] Computing 121 nearest neighbors...
[t-SNE] Indexed 1682 samples in 0.002s...
[t-SNE] Computed neighbors for 1682 samples in 0.107s...
[t-SNE] Computed conditional probabilities for sample 1000 / 1682
[t-SNE] Computed conditional probabilities for sample 1682 / 1682
[t-SNE] Mean sigma: 0.251844
[t-SNE] KL divergence after 250 iterations with early exaggeration: 58.368660
[t-SNE] KL divergence after 400 iterations: 1.505866
Out[ ]:

We should observe that the embeddings have a lot more structure than the unregularized case. Try selecting different genres and observe how they tend to form clusters (for example Horror, Animation and Children).

Conclusion

This concludes this section on matrix factorization models. Note that while the scale of the problem is small enough to allow efficient training using SGD, many practical problems need to be trained using more specialized algorithms such as Alternating Least Squares (see tf.contrib.factorization.WALSMatrixFactorization for a TF implementation).

VI. Softmax model

In this section, we will train a simple softmax model that predicts whether a given user has rated a movie.

Note: if you are taking the self-study version of the class, make sure to read through the part of the class covering Softmax training before working on this part.

The model will take as input a feature vector $x$ representing the list of movies the user has rated. We start from the ratings DataFrame, which we group by user_id.

In [ ]:
rated_movies = (ratings[["user_id", "movie_id"]]
                .groupby("user_id", as_index=False)
                .aggregate(lambda x: list(x)))
rated_movies 
Out[ ]:
user_id movie_id
0 0 [60, 188, 32, 159, 19, 201, 170, 264, 154, 116...
1 1 [291, 250, 49, 313, 296, 289, 311, 280, 12, 27...
2 10 [110, 557, 731, 226, 424, 739, 722, 37, 724, 1...
3 100 [828, 303, 595, 221, 470, 404, 280, 251, 281, ...
4 101 [767, 822, 69, 514, 523, 321, 624, 160, 447, 4...
... ... ...
938 95 [155, 86, 672, 478, 152, 90, 518, 7, 483, 644,...
939 96 [193, 227, 221, 669, 88, 481, 483, 97, 49, 114...
940 97 [46, 162, 516, 654, 320, 69, 208, 937, 115, 74...
941 98 [3, 267, 78, 110, 1015, 872, 402, 245, 273, 49...
942 99 [343, 353, 267, 320, 354, 749, 265, 287, 301, ...

943 rows × 2 columns

In [ ]:
rated_movies[rated_movies['user_id'] == "20"]
Out[ ]:
user_id movie_id
113 20 [0, 1, 3, 6, 7, 10, 11, 16, 20, 21, 23, 26, 27...

We then create a function that generates an example batch, such that each example contains the following features:

  • movie_id: A tensor of strings of the movie ids that the user rated.
  • genre: A tensor of strings of the genres of those movies
  • year: A tensor of strings of the release year.
In [ ]:
#@title Batch generation code (run this cell)
years_dict = {
    movie: year for movie, year in zip(movies["movie_id"], movies["year"])
}
genres_dict = {
    movie: genres.split('-')
    for movie, genres in zip(movies["movie_id"], movies["all_genres"])
}

def make_batch(ratings, batch_size):
  """Creates a batch of examples.
  Args:
    ratings: A DataFrame of ratings such that examples["movie_id"] is a list of
      movies rated by a user.
    batch_size: The batch size.
  """
  def pad(x, fill):
    return pd.DataFrame.from_dict(x).fillna(fill).values

  movie = []
  year = []
  genre = []
  label = []
  for movie_ids in ratings["movie_id"].values:
    movie.append(movie_ids)
    genre.append([x for movie_id in movie_ids for x in genres_dict[movie_id]])
    year.append([years_dict[movie_id] for movie_id in movie_ids])
    label.append([int(movie_id) for movie_id in movie_ids])
  features = {
      "movie_id": pad(movie, ""),
      "year": pad(year, ""),
      "genre": pad(genre, ""),
      "label": pad(label, -1)
  }
  batch = (
      tf.data.Dataset.from_tensor_slices(features)
      .shuffle(1000)
      .repeat()
      .batch(batch_size)
      .make_one_shot_iterator()
      .get_next())
  return batch

def select_random(x):
  """Selectes a random elements from each row of x."""
  def to_float(x):
    return tf.cast(x, tf.float32)
  def to_int(x):
    return tf.cast(x, tf.int64)
  batch_size = tf.shape(x)[0]
  rn = tf.range(batch_size)
  nnz = to_float(tf.count_nonzero(x >= 0, axis=1))
  rnd = tf.random_uniform([batch_size])
  ids = tf.stack([to_int(rn), to_int(nnz * rnd)], axis=1)
  return to_int(tf.gather_nd(x, ids))

Exercise 7: Write a loss function for the softmax model.

In this exercise, we will write a function that takes tensors representing the user embeddings $\psi(x)$, movie embeddings $V$, target label $y$, and return the cross-entropy loss.

Hint: You can use the function tf.nn.sparse_softmax_cross_entropy_with_logits, which takes logits as input, where logits refers to the product $\psi(x) V^\top$.

In [ ]:
# @title Solution
def softmax_loss(user_embeddings, movie_embeddings, labels):
  """Returns the cross-entropy loss of the softmax model.
  Args:
    user_embeddings: A tensor of shape [batch_size, embedding_dim].
    movie_embeddings: A tensor of shape [num_movies, embedding_dim].
    labels: A tensor of [batch_size], such that labels[i] is the target label
      for example i.
  Returns:
    The mean cross-entropy loss.
  """
  # Verify that the embddings have compatible dimensions
  user_emb_dim = user_embeddings.shape[1].value
  movie_emb_dim = movie_embeddings.shape[1].value
  if user_emb_dim != movie_emb_dim:
    raise ValueError(
        "The user embedding dimension %d should match the movie embedding "
        "dimension % d" % (user_emb_dim, movie_emb_dim))

  logits = tf.matmul(user_embeddings, movie_embeddings, transpose_b=True)
  loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(
      logits=logits, labels=labels))
  return loss

Exercise 8: Build a softmax model, train it, and inspect its embeddings.

We are now ready to build a softmax CFModel. Complete the build_softmax_model function in the next cell. The architecture of the model is defined in the function create_user_embeddings and illustrated in the figure below. The input embeddings (movie_id, genre and year) are concatenated to form the input layer, then we have hidden layers with dimensions specified by the hidden_dims argument. Finally, the last hidden layer is multiplied by the movie embeddings to obtain the logits layer. For the target label, we will use a randomly-sampled movie_id from the list of movies the user rated.

Softmax model

Complete the function below by creating the feature columns and embedding columns, then creating the loss tensors both for the train and test sets (using the softmax_loss function of the previous exercise).

In [ ]:
def movie_neighbors2(model, title_substring, measure=DOT, k=12):
  # Search for movie ids that match the given substring.
  ids =  movies[movies['title'].str.contains(title_substring)].index.values
  titles = movies.iloc[ids]['title'].values
  if len(titles) == 0:
    raise ValueError("Found no movies with title %s" % title_substring)
  print("Nearest neighbors of : %s." % titles[0])
  if len(titles) > 1:
    print("[Found more than one matching movie. Other candidates: {}]".format(
        ", ".join(titles[1:])))
  movie_id = ids[0]
  scores = compute_scores(
      model.embeddings["movie_id"][movie_id], model.embeddings["movie_id"],
      measure)
  score_key = measure + ' score'
  df = pd.DataFrame({
      score_key: list(scores),
      'titles': movies['title'],
      'genres': movies['all_genres']
  })
  display.display(df.sort_values([score_key], ascending=False).head(k))
In [ ]:
# @title User recommendations and nearest neighbors (run this cell)
def user_recommendations2(model, measure=DOT, exclude_rated=False, k=6):
  if USER_RATINGS:
    scores = compute_scores(
        model.embeddings["user_id"][20], model.embeddings["movie_id"], measure)
    score_key = measure + ' score'
    df = pd.DataFrame({
        score_key: list(scores),
        'movie_id': movies['movie_id'],
        'titles': movies['title'],
        'genres': movies['all_genres'],
    })
    if exclude_rated:
      # remove movies that are already rated
      rated_movies = ratings[ratings.user_id == "20"]["movie_id"].values
      df = df[df.movie_id.apply(lambda movie_id: movie_id not in rated_movies)]
    display.display(df.sort_values([score_key], ascending=False).head(k))  
In [ ]:
# @title User recommendations and nearest neighbors (run this cell)
def user_recommendations3(model, measure=DOT, exclude_rated=False, k=6):
  if USER_RATINGS:
    scores = compute_scores(
        model.embeddings["user_idtest"][20], model.embeddings["movie_id"], measure)
    score_key = measure + ' score'
    df = pd.DataFrame({
        score_key: list(scores),
        'movie_id': movies['movie_id'],
        'titles': movies['title'],
        'genres': movies['all_genres'],
    })
    if exclude_rated:
      # remove movies that are already rated
      rated_movies = ratings[ratings.user_id == "20"]["movie_id"].values
      df = df[df.movie_id.apply(lambda movie_id: movie_id not in rated_movies)]
    display.display(df.sort_values([score_key], ascending=False).head(k))  
In [ ]:
user_recommendations3(softmax_model, COSINE, exclude_rated=True, k=10)
cosine score movie_id titles genres
402 0.774 402 Batman (1989) Action-Adventure-Crime-Drama
830 0.741 830 Escape from L.A. (1996) Action-Adventure-Sci-Fi-Thriller
567 0.701 567 Speed (1994) Action-Romance-Thriller
120 0.675 120 Independence Day (ID4) (1996) Action-Sci-Fi-War
824 0.673 824 Arrival, The (1996) Action-Sci-Fi-Thriller
229 0.672 229 Star Trek IV: The Voyage Home (1986) Action-Adventure-Sci-Fi
221 0.657 221 Star Trek: First Contact (1996) Action-Adventure-Sci-Fi
471 0.645 471 Dragonheart (1996) Action-Adventure-Fantasy
227 0.644 227 Star Trek: The Wrath of Khan (1982) Action-Adventure-Sci-Fi
454 0.642 454 Jackie Chan's First Strike (1996) Action
In [ ]:
user_recommendations3(softmax_model, DOT, exclude_rated=True, k=10)
dot score movie_id titles genres
173 6.614 173 Raiders of the Lost Ark (1981) Action-Adventure
482 6.560 482 Casablanca (1942) Drama-Romance-War
317 6.436 317 Schindler's List (1993) Drama-War
22 6.274 22 Taxi Driver (1976) Drama-Thriller
514 6.259 514 Boot, Das (1981) Action-Drama-War
478 6.199 478 Vertigo (1958) Mystery-Thriller
479 6.163 479 North by Northwest (1959) Comedy-Thriller
510 6.117 510 Lawrence of Arabia (1962) Adventure-War
473 6.070 473 Dr. Strangelove or: How I Learned to Stop Worr... Sci-Fi-War
522 6.067 522 Cool Hand Luke (1967) Comedy-Drama
In [ ]:
user_recommendations3(softmax_model, DOT, exclude_rated=False, k=10)
dot score movie_id titles genres
179 6.690 179 Apocalypse Now (1979) Drama-War
173 6.614 173 Raiders of the Lost Ark (1981) Action-Adventure
482 6.560 482 Casablanca (1942) Drama-Romance-War
317 6.436 317 Schindler's List (1993) Drama-War
55 6.388 55 Pulp Fiction (1994) Crime-Drama
21 6.362 21 Braveheart (1995) Action-Drama-War
186 6.318 186 Godfather: Part II, The (1974) Action-Crime-Drama
22 6.274 22 Taxi Driver (1976) Drama-Thriller
514 6.259 514 Boot, Das (1981) Action-Drama-War
478 6.199 478 Vertigo (1958) Mystery-Thriller
In [ ]:
user_recommendations2(reg_model, DOT, exclude_rated=True, k=10)
dot score movie_id titles genres
171 4.730 171 Empire Strikes Back, The (1980) Action-Adventure-Drama-Romance-Sci-Fi-War
173 4.514 173 Raiders of the Lost Ark (1981) Action-Adventure
256 4.471 256 Men in Black (1997) Action-Adventure-Comedy-Sci-Fi
209 4.006 209 Indiana Jones and the Last Crusade (1989) Action-Adventure
297 3.975 297 Face/Off (1997) Action-Sci-Fi-Thriller
78 3.890 78 Fugitive, The (1993) Action-Thriller
407 3.875 407 Close Shave, A (1995) Animation-Comedy-Thriller
402 3.869 402 Batman (1989) Action-Adventure-Crime-Drama
384 3.848 384 True Lies (1994) Action-Adventure-Comedy-Romance
317 3.490 317 Schindler's List (1993) Drama-War
In [ ]:
user_recommendations2(softmax_model, DOT, exclude_rated=True, k=10)
dot score movie_id titles genres
318 11.326 318 Everyone Says I Love You (1996) Comedy-Musical-Romance
324 10.867 324 Crash (1996) Drama-Thriller
323 10.689 323 Lost Highway (1997) Mystery
332 10.438 332 Game, The (1997) Mystery-Thriller
874 10.299 874 She's So Lovely (1997) Drama-Romance
990 10.156 990 Keys to Tulsa (1997) Crime
326 10.071 326 Cop Land (1997) Crime-Drama-Mystery
244 10.000 244 Devil's Own, The (1997) Action-Drama-Thriller-War
304 9.863 304 Ice Storm, The (1997) Drama
301 9.824 301 L.A. Confidential (1997) Crime-Film-Noir-Mystery-Thriller
In [ ]:
user_recommendations2(softmax_model, DOT, exclude_rated=True, k=20)
dot score movie_id titles genres
110 7.007 110 Truth About Cats & Dogs, The (1996) Comedy-Romance
221 6.836 221 Star Trek: First Contact (1996) Action-Adventure-Sci-Fi
293 6.607 293 Liar Liar (1997) Comedy
454 6.602 454 Jackie Chan's First Strike (1996) Action
120 6.549 120 Independence Day (ID4) (1996) Action-Sci-Fi-War
... ... ... ... ...
124 6.056 124 Phenomenon (1996) Drama-Romance
474 5.975 474 Trainspotting (1996) Drama
596 5.964 596 Eraser (1996) Action-Thriller
741 5.942 741 Ransom (1996) Drama-Thriller
762 5.942 762 Happy Gilmore (1996) Comedy

20 rows × 4 columns

In [ ]:
# @title Solution

def build_softmax_model(rated_movies, embedding_cols, hidden_dims):
  """Builds a Softmax model for MovieLens.
  Args:
    rated_movies: DataFrame of traing examples.
    embedding_cols: A dictionary mapping feature names (string) to embedding
      column objects. This will be used in tf.feature_column.input_layer() to
      create the input layer.
    hidden_dims: int list of the dimensions of the hidden layers.
  Returns:
    A CFModel object.
  """
  def create_network(features):
    """Maps input features dictionary to user embeddings.
    Args:
      features: A dictionary of input string tensors.
    Returns:
      outputs: A tensor of shape [batch_size, embedding_dim].
    """
    # Create a bag-of-words embedding for each sparse feature.
    inputs = tf.feature_column.input_layer(features, embedding_cols)
    # Hidden layers.
    input_dim = inputs.shape[1].value
    for i, output_dim in enumerate(hidden_dims):
      w = tf.get_variable(
          "hidden%d_w_" % i, shape=[input_dim, output_dim],
          initializer=tf.truncated_normal_initializer(
              stddev=1./np.sqrt(output_dim))) / 10.
      outputs = tf.matmul(inputs, w)
      input_dim = output_dim
      inputs = outputs
    return outputs

  train_rated_movies, test_rated_movies = split_dataframe(rated_movies)
  train_batch = make_batch(train_rated_movies, 800)
  test_batch = make_batch(test_rated_movies, 800)

  with tf.variable_scope("model", reuse=False):
    # Train
    train_user_embeddings = create_network(train_batch)
    train_labels = select_random(train_batch["label"])
  with tf.variable_scope("model", reuse=True):
    # Test
    test_user_embeddings = create_network(test_batch)
    test_labels = select_random(test_batch["label"])
    movie_embeddings = tf.get_variable(
        "input_layer/movie_id_embedding/embedding_weights")

  test_loss = softmax_loss(
      test_user_embeddings, movie_embeddings, test_labels)
  train_loss = softmax_loss(
      train_user_embeddings, movie_embeddings, train_labels)
  _, test_precision_at_10 = tf.metrics.precision_at_k(
      labels=test_labels,
      predictions=tf.matmul(test_user_embeddings, movie_embeddings, transpose_b=True),
      k=10)

  metrics = (
      {"train_loss": train_loss, "test_loss": test_loss},
      {"test_precision_at_10": test_precision_at_10}
  )
  embeddings = {"movie_id": movie_embeddings, "user_id": train_user_embeddings, "user_idtest": test_user_embeddings}
  return CFModel(embeddings, train_loss, metrics)

Train the Softmax model

We are now ready to train the softmax model. You can set the following hyperparameters:

  • learning rate
  • number of iterations. Note: you can run softmax_model.train() again to continue training the model from its current state.
  • input embedding dimensions (the input_dims argument)
  • number of hidden layers and size of each layer (the hidden_dims argument)

Note: since our input features are string-valued (movie_id, genre, and year), we need to map them to integer ids. This is done using tf.feature_column.categorical_column_with_vocabulary_list, which takes a vocabulary list specifying all the values the feature can take. Then each id is mapped to an embedding vector using tf.feature_column.embedding_column.

In [ ]:
# Create feature embedding columns
def make_embedding_col(key, embedding_dim):
  categorical_col = tf.feature_column.categorical_column_with_vocabulary_list(
      key=key, vocabulary_list=list(set(movies[key].values)), num_oov_buckets=0)
  return tf.feature_column.embedding_column(
      categorical_column=categorical_col, dimension=embedding_dim,
      # default initializer: trancated normal with stddev=1/sqrt(dimension)
      combiner='mean')

with tf.Graph().as_default():
  softmax_model = build_softmax_model(
      rated_movies,
      embedding_cols=[
          make_embedding_col("movie_id", 35),
          make_embedding_col("genre", 3),
          make_embedding_col("year", 2),
      ],
      hidden_dims=[35])

softmax_model.train(
    learning_rate=8., num_iterations=3000, optimizer=tf.train.AdagradOptimizer)
 iteration 3000: train_loss=5.488488, test_loss=5.862362, test_precision_at_10=0.011802
Out[ ]:
({'test_loss': 5.862362, 'train_loss': 5.4884877},
 {'test_precision_at_10': 0.011801607797400866})

Inspect the embeddings

We can inspect the movie embeddings as we did for the previous models. Note that in this case, the movie embeddings are used at the same time as input embeddings (for the bag of words representation of the user history), and as softmax weights.

In [ ]:
user_recommendations2(reg_model, DOT, exclude_rated=True, k=10)
dot score movie_id titles genres
171 4.730 171 Empire Strikes Back, The (1980) Action-Adventure-Drama-Romance-Sci-Fi-War
173 4.514 173 Raiders of the Lost Ark (1981) Action-Adventure
256 4.471 256 Men in Black (1997) Action-Adventure-Comedy-Sci-Fi
209 4.006 209 Indiana Jones and the Last Crusade (1989) Action-Adventure
297 3.975 297 Face/Off (1997) Action-Sci-Fi-Thriller
78 3.890 78 Fugitive, The (1993) Action-Thriller
407 3.875 407 Close Shave, A (1995) Animation-Comedy-Thriller
402 3.869 402 Batman (1989) Action-Adventure-Crime-Drama
384 3.848 384 True Lies (1994) Action-Adventure-Comedy-Romance
317 3.490 317 Schindler's List (1993) Drama-War
In [ ]:
user_recommendations2(softmax_model, DOT, exclude_rated=True, k=10)
dot score movie_id titles genres user_id
318 11.326 318 Everyone Says I Love You (1996) Comedy-Musical-Romance 318
324 10.867 324 Crash (1996) Drama-Thriller 324
323 10.689 323 Lost Highway (1997) Mystery 323
332 10.438 332 Game, The (1997) Mystery-Thriller 332
874 10.299 874 She's So Lovely (1997) Drama-Romance 874
990 10.156 990 Keys to Tulsa (1997) Crime NaN
326 10.071 326 Cop Land (1997) Crime-Drama-Mystery 326
244 10.000 244 Devil's Own, The (1997) Action-Drama-Thriller-War 244
304 9.863 304 Ice Storm, The (1997) Drama 304
301 9.824 301 L.A. Confidential (1997) Crime-Film-Noir-Mystery-Thriller 301
In [ ]:
user_recommendations2(reg_model, DOT, exclude_rated=True, k=10)
dot score movie_id titles genres user_id
171 4.730 171 Empire Strikes Back, The (1980) Action-Adventure-Drama-Romance-Sci-Fi-War 171
173 4.514 173 Raiders of the Lost Ark (1981) Action-Adventure 173
256 4.471 256 Men in Black (1997) Action-Adventure-Comedy-Sci-Fi 256
209 4.006 209 Indiana Jones and the Last Crusade (1989) Action-Adventure 209
297 3.975 297 Face/Off (1997) Action-Sci-Fi-Thriller 297
78 3.890 78 Fugitive, The (1993) Action-Thriller 78
407 3.875 407 Close Shave, A (1995) Animation-Comedy-Thriller 407
402 3.869 402 Batman (1989) Action-Adventure-Crime-Drama 402
384 3.848 384 True Lies (1994) Action-Adventure-Comedy-Romance 384
317 3.490 317 Schindler's List (1993) Drama-War 317
In [ ]:
movie_neighbors(softmax_model, "Aladdin", DOT)
movie_neighbors(softmax_model, "Aladdin", COSINE)
Nearest neighbors of : Aladdin (1992).
[Found more than one matching movie. Other candidates: Aladdin and the King of Thieves (1996)]
dot score titles genres
94 11.483 Aladdin (1992) Animation-Children-Comedy-Musical
171 10.899 Empire Strikes Back, The (1980) Action-Adventure-Drama-Romance-Sci-Fi-War
172 10.849 Princess Bride, The (1987) Action-Adventure-Comedy-Romance
167 10.596 Monty Python and the Holy Grail (1974) Comedy
203 10.595 Back to the Future (1985) Comedy-Sci-Fi
... ... ... ...
21 10.324 Braveheart (1995) Action-Drama-War
257 10.244 Contact (1997) Drama-Sci-Fi
587 10.227 Beauty and the Beast (1991) Animation-Children-Musical
256 10.216 Men in Black (1997) Action-Adventure-Comedy-Sci-Fi
110 10.189 Truth About Cats & Dogs, The (1996) Comedy-Romance

12 rows × 3 columns

Nearest neighbors of : Aladdin (1992).
[Found more than one matching movie. Other candidates: Aladdin and the King of Thieves (1996)]
cosine score titles genres
94 1.000 Aladdin (1992) Animation-Children-Comedy-Musical
81 0.833 Jurassic Park (1993) Action-Adventure-Sci-Fi
587 0.830 Beauty and the Beast (1991) Animation-Children-Musical
27 0.809 Apollo 13 (1995) Action-Drama-Thriller
172 0.788 Princess Bride, The (1987) Action-Adventure-Comedy-Romance
... ... ... ...
171 0.753 Empire Strikes Back, The (1980) Action-Adventure-Drama-Romance-Sci-Fi-War
264 0.749 Hunt for Red October, The (1990) Action-Thriller
431 0.739 Fantasia (1940) Animation-Children-Musical
209 0.737 Indiana Jones and the Last Crusade (1989) Action-Adventure
90 0.721 Nightmare Before Christmas, The (1993) Children-Comedy-Musical

12 rows × 3 columns

In [ ]:
movie_embedding_norm([reg_model, softmax_model])
Out[ ]:
In [ ]:
tsne_movie_embeddings(softmax_model)
Running t-SNE...
[t-SNE] Computing 121 nearest neighbors...
[t-SNE] Indexed 1682 samples in 0.000s...
[t-SNE] Computed neighbors for 1682 samples in 0.090s...
[t-SNE] Computed conditional probabilities for sample 1000 / 1682
[t-SNE] Computed conditional probabilities for sample 1682 / 1682
[t-SNE] Mean sigma: 0.206902
[t-SNE] KL divergence after 250 iterations with early exaggeration: 54.926720
[t-SNE] KL divergence after 400 iterations: 1.314252
Out[ ]:
In [ ]:
tsne_movie_embeddings(reg_model)
Running t-SNE...
[t-SNE] Computing 121 nearest neighbors...
[t-SNE] Indexed 1682 samples in 0.001s...
[t-SNE] Computed neighbors for 1682 samples in 0.114s...
[t-SNE] Computed conditional probabilities for sample 1000 / 1682
[t-SNE] Computed conditional probabilities for sample 1682 / 1682
[t-SNE] Mean sigma: 0.250537
[t-SNE] KL divergence after 250 iterations with early exaggeration: 58.184166
[t-SNE] KL divergence after 400 iterations: 1.497088
Out[ ]:
In [ ]:
user_recommendations(softmax_model, measure=COSINE, k=10)
cosine score movie_id titles genres
714 0.800 714 To Die For (1995) Comedy-Drama
238 0.769 238 Sneakers (1992) Crime-Drama-Sci-Fi
87 0.767 87 Sleepless in Seattle (1993) Comedy-Romance
65 0.765 65 While You Were Sleeping (1995) Comedy-Romance
195 0.765 195 Dead Poets Society (1989) Drama
50 0.750 50 Legends of the Fall (1994) Drama-Romance-War-Western
217 0.741 217 Cape Fear (1991) Thriller
156 0.735 156 Platoon (1986) Drama-War
450 0.734 450 Grease (1978) Comedy-Musical-Romance
158 0.726 158 Basic Instinct (1992) Mystery-Thriller

Congratulations!

You have completed this Colab notebook.

If you would like to further explore these models, we encourage you to try different hyperparameters and observe how this affects the quality of the model and the structure of the embedding space. Here are some suggestions:

  • Change the embedding dimension.
  • In the softmax model: change the number of hidden layers, and the input features. For example, you can try a model with no hidden layers, and only the movie ids as inputs.
  • Using other similarity measures: In this Colab notebook, we used dot product $d(u, V_j) = \langle u, V_j \rangle$ and cosine $d(u, V_j) = \frac{\langle u, V_j \rangle}{\|u\|\|V_j\|}$, and discussed how the norms of the embeddings affect the recommendations. You can also try other variants which apply a transformation to the norm, for example $d(u, V_j) = \frac{\langle u, V_j \rangle}{\|V_j\|^\alpha}$.
In [ ]: